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Abstract 

We study the complexity of the stock market by constructing e-machines of Stan- 
dard and Poor's 500 index from February 1983 to April 2006 and by measuring 
the statistical complexities. It is found that both the statistical complexity and the 
number of causal states of constructed e-machines have decreased for last twenty 
years and that the average memory length needed to predict the future optimally 
has become shorter. These results support that the information is delivered to the 
economic agents and applied to the market prices more rapidly in year 2006 than 
in year 1983. 
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1 Introduction 



Financial systems have been one of active research fields for physicists. This 
interdisciplinary research area called econophysics has been investigated by 
means of various statistical methods, such as the correlation function, mul- 
tifractality, minimal spanning tree, minority games, continuous-time random 
walks, and spin models [Tj2,3.4.5.6.7.8j[D] . Recently many empirical time se- 
ries in financial markets become available and has been also investigated by 
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the rescaled range (R/S) analysis to test the presence of correlations [10] and 
detrended fluctuation analysis to detect long-range correlations embedded in 
seemingly non- stationary time series [TlfTS] and so on. 

In this paper we adopt the computational mechanics (CM) [T3|Ti] to investi- 
gate the complexity of the stock market. The CM is based on the early works 
of the information and computation theory done by Shannon, Kolmogorov, 
and Chaitin [T5ll6lll7j . Despite its strong functionality, CM has been applied 
only to analyze the abstract models such as cellular automata [T8"irT9"] and Ising 
spin system [20], or empirical data in the geomagnetism [21] and in the at- 
mosphere [22] . We believe that CM enables the complexities and structures of 
different sets of data to be quantifiably compared and that it directly discovers 
intrinsic causal structure within the data [2T]. This approach also shows how 
to infer a model of the hidden process that generated the observed behavior. 

We examined the tick data of Standard and Poor's 500 (S&P500) index from 
February 1983 to April 2006 by constructing deterministic finite automata 
called "epsilon-machine" [23] from the financial time series and by calculat- 
ing the statistical complexity from the constructed machine. The e-machine 
captures the patterns and regularities in the observations in a way that re- 
flects the causal structure of the process. With this model in hand, we can 
extrapolate beyond the original observations to predict future behavior |14j . 
The constructed e-machine is a step toward the eventual use of such machine 
in finding effective patterns embedded in the price index of stock market. This 
is a novel approach to predict the next action on the stock market with sta- 
tistical probabilities. We also analyzed the result that the complexity of the 
stock market has decreased. 



2 Principles 

According to Feldman [13] and Shailizi [TJ] we introduce the basics regarding 
to the e-machine and the statistical complexity as complexity measure. 

2.1 e-machine 

We consider a stochastic process given by an infinitely consecutive discrete 
random variables, X = ■ ■ ■ A_i X0X1X2 • • • , where each X« may take a symbol 
Xi drawn from a finite countable set A of size k. At any time t this sequence 
of random variables can be divided into two semi-infinite halves; a history X t 
and a future X t . If the process is conditionally stationary, i.e. for all possible 
future events F, Pr(X 4 e F \ X t = x) does not depend on t, then we drop 
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the subscript. And X 1 and X 1 denote the first I variables of X and the last I 
variables of X , respectively. 

A causal state is defined as a set of history events (shortly, histories) that 
have the same distribution of conditional probabilities for all possible future 
events, e is a function that maps from histories to sets of histories: 

e{*x) = {7 | Pr(A = ~x \ X = *x) = Pr(A = ^ | A = G F}. (1) 

Each causal state consists of its name i, a set of histories e(x), and the 

conditional probability distribution Pr(A G F \ X = x), which is called 
"morph." S denotes the corresponding random variable and S does the set 
of all causal states. Then, the transition probability is defined as the 
probability of generating a symbol a G A when making the transition from 

state Si to state Sj] 

T}f = Pr(Xa e Sj \X E 8i ), (2) 

where A a is read as a semi- infinite sequence obtained by concatenating a G A 
onto the end of A. Equivalently, 

T t f = Pr(S> = s v X l = a\S = s i ), (3) 

where S and S' are random variables for the current causal state and its 
successor, respectively. The combination of the function e mapping from his- 
tories to causal states with the labeled transition probabilities T^p is called 
the e-machine, which represents a computational model underlying the given 
time series. The causal states and the transitions of e-machine form a directed 
graph, therefore there can exist some states being never returned once the 
system left those states. These are called transient states that cannot be the 
true causal states so removed and the others are recurrent states [21] . Once 
the e-machine is constructed and the current causal state is identified, one can 
optimally predict the future behavior of the process with some conditional 
probability distributions, which will be useful in practice, for example, for 
traders in financial markets. 

For the operational applications the length of histories to be considered is 
limited as L max for a process with finitely consecutive random variables or 
finite time series. L max should be large enough to fully detect the structure 
embedded in the process. On the other hand it is also limited by the total 
number N of data points available in a way of L max < log fc N [21] for the 
significant analysis. 
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2.2 Statistical and topological complexities 

From the constructed e-machine, for each i the probability of finding the sys- 
tem in the z-th causal state after the machine has been running infinitely 
long, Pr(sj), can be calculated. The components Tij of the transition matrix 
T = Y^aeA^ij gi ves the probability of a transition from state to state Sj. 
Pr(sj) are obtained by solving the following: 

£Pr( ai )T« = Pr( ai ). (4) 

i 

Then the statistical and topological complexities are defined as 

C M = -^Pr( Si )log 2 Pr( Si ), (5) 

i 

C = log 2 ||S||, (6) 

where || • || represents the cardinality of a set. C M measures the minimum 
amount of historical information required to make optimal forecasts [2T)f2"2"] . 
By the definitions of the statistical and topological complexities, the topo- 
logical complexity is the upper bound of the statistical complexity. And the 
equality holds when the distribution is uniform, that is, for all causal states 
Pr(sj) = 1/||S||. As the probability distribution of causal states deviate from 
uniformity, the statistical complexity becomes smaller and therefore far from 
the topological complexity. 

2.3 Simple examples 

Before closing this section, a few simplest examples are examined by con- 
structing the e-machines. The first is a process generating the same symbol 
infinitely, such as ■ • • 000 • • • . There exists the only one causal state consisting 
of the only one history x = • • • 000 and the morph Pr(0 | x) = 1. If we depict 
each causal state as a node and each transition from state % to j generating 
a symbol a with probability p as an arc starting from a node i to j on which 
'a | p 1 is labeled, then the above e-machine would be depicted as one node and 
one arc going back to itself with '0 | 1' labeled. In this case both complexities 
of Eqs. ([MS]) become bit. 

The second example is a periodic series with period 2, such as ■ • -0101 
Two histories are found: x\ = • • • 101, x% = • • • 010. Since Pr(0 | x±) = 1 while 
Pr(0 | X2) = 0, each histories constitutes each causal state. The constructed 
e-machine consists of two nodes and two arcs coming from one circle to the 
other, respectively. In this case both complexities of Eqs. (IMHD become 1 bit. 
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As the final one we toss a fair coin and record H for head, T for tail, and get a 
random process such as • • • HHTHT ■ ■ ■ . There are infinitely many histories 
but all the morphs are the same as Pr(H | a^) = Pr(T | a^) = 1/2. Therefore 
the only one causal state is enough for the e-machine, which consists of one 
node and two arcs going back to itself but with different labels, 'if | 1/2' and 
T | 1/2', respectively. Both complexities of Eqs. ([MS]) become bit, which 
means the totally random process is not complex at all. Thus, these measures 
satisfy the so-called 'boundary condition' for the complexity measure that 
vanishes in the extreme ordered and disordered limits [25] . 



3 Empirical data analysis 

For the tick- by-tick S&P500 index data from February 1983 to April 2006, 
as shown in Fig. 1, the statistical and topological complexities are calculated 
from constructed e-machines [26]. By using a time window of one year and 
shifting the window by one month, we get 267 data sets and for each data set 
each e-machine is constructed. For convenience each data set is named after its 
starting month, for example, the data set for one year since February 1983 is 
called February 1983 data. The average number of data points in one minute 
varies from about 1 in the early 1980's to 4 in recent years. For the analysis 
we firstly set a countable set A to the smallest set of size k = 2, such as {0, 1}. 
Then the original index data Y n change into the binary time series F n by the 
following process: 

F n = 6(Y n+1 -Y n ), (7) 

where 8(x) is a Heaviside step function. F n gets the value of if the next 
index has decreased and does the value of 1 otherwise. Since New York Stock 
Exchange opens during the day time, we use only the intra-day data to avoid 
the discontinuous jump between the previous day's closing index and the next 
day's opening index due to overnight effects. In other words we exclude F n for 
the difference between the last index of the previous day and the first index 
of the next day. 

To construct e-machines we set L max to 6, which gives the most reliable results. 
Figures 2 and 3 depict the e-machines for the February 1983 data and for the 
April 2005 data, respectively. It is noteworthy that the number of causal states 
has decreased for last twenty years, which will be discussed later. In those 
figures, as mentioned, each numbered node represents a causal state, while 
each arc joining one node to another does the transition from one causal state 
to another. Each arc is labeled with 'a | p\ that is, a symbol a is generated 
with probability p by that transition. For example, in Fig. 3 if the current 
state of the system is the 0th one, the system goes back to the 0th state by 
generating symbol 1 with probability p = 0.634742, while the system makes a 
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transition to the 1st state by generating symbol with probability 1 — p. The 
directions of these arcs tell us which causal state will be followed. 

In more details, we investigate the histories belonging to each causal state. 
The histories of each causal state for two e-machines mentioned before are 
shown in Tables 1 and 2, respectively. Since we set the length of the longest 
histories to be considered as 6, there are 2 6 = 64 possible histories of length 
6. In particular the histories of the Oth causal state in Fig. 3 can be found in 
the first row of Table 2. If we add symbol 1 on the right end of each history 
and limit the length to 6 (e.g., 000011 — > 000111), we can see that all the 
resulting histories remain included in the Oth causal state. For the opposite 
case of adding symbol on the right end (e.g., 000011 — > 000110), all the 
resulting histories are found in the 1st causal state. By making use of this 
method repeatedly as we want, we can predict the next finitely consecutive 
symbols with a certain probability, which can be obtained just by multiplying 
the transition probabilities along the arcs. 

Next, we found that both the statistical and topological complexities of S&P500 
index have a tendency of decreasing through time as shown in Fig. 4. Since 
the time window is set to one year, it is assumed that many short term events 
in the stock market, such as the Black Monday, do not affect our analysis. 
Therefore we focus on the long term behaviors of both complexities. Since 
the difference between the statistical and topological complexities is not sig- 
nificant for the whole range of times, the probability distributions for causal 
states are almost uniform throughout time. Conclusively our main concern is 
reduced to the decrease in the number of causal states. Precisely, the total 
number of causal states decreases from 42 for the February 1983 data to 4 for 
the April 2005 data. 

To find the underlying principle of the decrease in the number of causal states 
through time, we revisit Tables 1 and 2 showing the histories of length L = 6 
of the causal states for the February 1983 and April 2005 data, respectively. 
In Table 1 the 62 histories are mapped to 42 causal states according to their 
morphs (the other 2 histories were in the removed transient states) and thus 
the causal states are composed of only one to three histories. It is found that for 
each causal state with two histories each two histories are the same except the 
left end symbols of them, that is, L = 5 is enough to identify such states. But 
for the majority of causal states L = 6 is necessary. Therefore it is reasonable 
to conclude that the average memory length we need to predict the future at 
the February 1983 data was 6, which corresponds to about 6 minutes. 

In Table 2, the causal states for the April 2005 data are more simplified than 
the above case. 64 histories are grouped into 4 causal states exactly containing 
16 histories, respectively. In each causal state all the histories have the last 
two symbols in common; 11 for the Oth causal state and 10 for the 1st one and 
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so on. The first 4 symbols of each history are all the possible matches of O's 
and l's. In this case the average memory length to predict the future for the 
April 2005 data was 2, i.e. about one half minute. In conclusion, the average 
memory length needed to predict the future has decreased from 6 in the early 
1980 's to 2 in recent years. 

Now the decreasing tendency of statistical and topological complexities for last 
twenty years is explained. We call the common part of histories in each causal 
state an 'effective pattern.' If the length of effective pattern L eS decreases, the 
number of possible effective patterns decreases as 2 Loff , so does that of the 
resultant causal states. Although we had set L max to 6 for the entire range of 
time, L e ff decreased from 6 to 2 in recent years. Since only effective patterns 
affect the identification of causal states and transitions among them, they 
contribute to predict the future and also can be interpreted as the correlation 
interval. Therefore, in the early 1980 's one had to look back 6 ticks for the 
prediction of the next tick index, that is, about 6 consecutive tick indices are 
correlated. On the other hand in recent years, one only need to look back 2 
ticks for the prediction, which means the shorter correlation than before. 

The correlation interval is closely related to the time scale for new informa- 
tion to be delivered to the economic agents and applied to the market prices 
[2~Tf2"8"] . The decreasing correlation interval for last twenty years supports that 
the information flows faster than before and that the memory length for the 
optimal prediction becomes smaller. 



4 Conclusions 



In this paper, we investigated the S&P500 index from February 1983 to April 
2006 by constructing e-machines to infer the hidden causal structures embed- 
ded in the data and by measuring the statistical and topological complexities 
from the e-machines. If in the constructed causal structure the current causal 
state is identified, then by following the path from state to state one can pre- 
dict the future behavior of a finite interval. This would be useful in practice, 
for example, for traders in financial markets. 

We also found that the statistical complexity and the number of causal states 
of constructed e-machines have decreased for last twenty years. Precisely, the 
length of effective patterns in histories has become shorter in recent years than 
in the early 1980 's. These results imply that the information flows faster and 
hence the memory length needed to predict the future optimally has become 
shorter. 
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Table 1 



The causal states of e-machine constructed from the February 1983 data. Each 
causal state consists of its name, histories, and morph. 
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Table 2 



The causal states of e-machine constructed from the April 2005 data. 
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Fig. 2. The e-machine constructed from the February 1983 data of 
S&P500 index. The figure has been produced with the Graphviz software 
(http://www.graphviz.org/ ). 
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Year 



Fig. 4. The decreasing behaviors of the statistical complexity and the topological 
complexity C of S&P500 index from February 1983 to April 2006. Notice that C 
is upper bound of C„. 
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